TD Networks
نویسندگان
چکیده
We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that when actions are introduced, and the inter-prediction relationships made contingent on them, the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first. To be more precise, we must make a distinction between the extensive definition of a prediction, expressing its desired relationship to measurable data, and its TD definition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future rewards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always among predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation ot ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, a1, o1, a2, o2, a3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 1 2 3 4 5 6 7 0 0 0 0 0 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk plus, in the first two of our three experiments, an explicit indication of the state number. This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume the actions are selected randomly with equal probability so the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences.
منابع مشابه
TD(λ) Networks: Temporal-Difference Networks with Eligibility Traces
Temporal-difference (TD) networks have been introduced as a formalism for expressing and learning grounded world knowledge in a predictive form (Sutton & Tanner, 2005). Like conventional TD(0) methods, the learning algorithm for TD networks uses 1-step backups to train prediction units about future events. In conventional TD learning, the TD(λ) algorithm is often used to do more general multi-s...
متن کاملPOMDP 環境中での TD - Network の自動獲得 : 単純再帰構造による拡張 Automatic Acquisition of TD - Network in POMDP Environments : Extension with SRN structure 牧野貴樹
We propose a new neural network architecture, Simple recurrent TD Networks (SR-TDNs), that learns to predict future observations in partially observable environments, using proto-predictive representation of states. SR-TDNs incorporate the structure of simple recurrent neural networks (SRNs) into temporal-difference (TD) networks to use proto-predictive representation of states. Our simulation ...
متن کاملTemporal-Difference Networks with History
Temporal-difference (TD) networks are a formalism for expressing and learning grounded world knowledge in a predictive form [Sutton and Tanner, 2005]. However, not all partially observable Markov decision processes can be efficiently learned with TD networks. In this paper, we extend TD networks by allowing the network-update process (answer network) to depend on the recent history of previous ...
متن کاملDirect Code Access in Self-Organizing Neural Networks for Reinforcement Learning
TD-FALCON is a self-organizing neural network that incorporates Temporal Difference (TD) methods for reinforcement learning. Despite the advantages of fast and stable learning, TD-FALCON still relies on an iterative process to evaluate each available action in a decision cycle. To remove this deficiency, this paper presents a direct code access procedure whereby TD-FALCON conducts instantaneous...
متن کاملUplink Scheduling and Adjacent-Channel Coupling Loss Analysis for TD-LTE Deployment
TD-LTE, one of the two duplexing modes in LTE, operates in unpaired spectrum and has the advantages of TDD-based technologies. It is expected that TD-LTE will be more rapidly deployed in near future and most of WiMax operators will upgrade their networks to TD-LTE gradually. Before completely upgrading to TD-LTE, WiMax may coexist with TD-LTE in an adjacent frequency band. In addition, multiple...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004